Retain underlying error on proxy mode connection failure#94998
Retain underlying error on proxy mode connection failure#94998elasticsearchmachine merged 20 commits intoelastic:mainfrom
Conversation
When a remote cluster node fails to connect in sniff mode, the underlying error is preserved and reported. But in proxy mode, the underlying error is swallowed and it reports only a generic "Unable to open any proxy connections" error. This PR improves the error reporting with proxy mode to align with that for the sniff mode. This allows consistent error message in the API response and clearer diagnositic for underlying issues, e.g. authentication failure, incompatible license etc.
|
Hi @ywangd, I've created a changelog YAML for you. |
|
Pinging @elastic/es-security (Team:Security) |
|
Pinging @elastic/es-distributed (Team:Distributed) |
| openConnections(finished, attemptNumber + 1); | ||
| if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) { | ||
| logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e); | ||
| finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e)); | ||
| } else { | ||
| openConnections(finished, attemptNumber + 1); |
There was a problem hiding this comment.
This change basically tries to mirror how error is handled in the sniff mode so that the underlying cause of the failure is retained and reported.
There was a problem hiding this comment.
With this change, do we still need to check connectionManager.size() in the else branch below? I.e. can we remove this cause-free exception?
finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias));
Or maybe we should have the attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN check in the onResponse handler above too.
There was a problem hiding this comment.
With this change, do we still need to check connectionManager.size() in the else branch below?
In practice, we should not need it. But in theory, it is possible that the if (attemptNumber <= MAX_CONNECT_ATTEMPTS_PER_RUN) check fails in the first invocation and goes to the else branch.
We have a similar else branch in SniffConnectionStrategy as well. Therefore I didn't remove it.
There was a problem hiding this comment.
Right, yes, but since this only affects the first invocation (and only if there are no seed nodes or some other weird config?) I think it'd be better to handle it up-front. That would clarify that once we get here we never return a cause-free NoSeedNodeLeftException.
| } | ||
| logger.warn(() -> "fetching nodes from external cluster [" + clusterAlias + "] failed", e); | ||
| listener.onFailure(e); | ||
| listener.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e)); |
There was a problem hiding this comment.
For consistency between the two modes, I am also updating this to be wrapped within a NoSeedNodeLeftException which makes the error message more explanatory in REST response.
There was a problem hiding this comment.
I think this only makes sense if seedNodesSuppliers.hasNext() == false. If we encountered some other more fundamental error then we didn't run out of seed nodes here.
Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.
There was a problem hiding this comment.
I think this only makes sense if seedNodesSuppliers.hasNext() == false
Good catch! Will fix.
Also IMO it would be good to eliminate the
seedNodesSuppliers.hasNext() == falsecase entirely in this method so that there's always a cause to report.
Sorry but I don't quite follow. There is no existing seedNodesSuppliers.hasNext() == false case. Are you referring to the else branch of the openning if (seedNodesSuppliers.hasNext()), i.e. related to the above comment? I think in practice the else branch never runs. So the cause is always reported. Please let me know if you'd like to drop the else branch for both Sniff and Proxy strategies. I can make the change to both. I just want them to be consistent.
There was a problem hiding this comment.
Yes I meant the else branch. I wonder if it could run if seedNodesSuppliers is empty. Not sure if this can happen but maybe it could in future even if not today? I'd like to protect against that at least, but again I think it's better to do so up-front rather than on every iteration.
| expectThrows(Exception.class, connectFuture::actionGet); | ||
| final NoSeedNodeLeftException exception = (NoSeedNodeLeftException) expectThrows( | ||
| ExecutionException.class, | ||
| connectFuture::get | ||
| ).getCause(); |
There was a problem hiding this comment.
Sniff mode was throwing a ConnectTransportException exception directly. As commented above, I changed it to be wrapped inside a NoSeedNodeLeftException.
DaveCTurner
left a comment
There was a problem hiding this comment.
Makes sense to me. I left a couple of inline thoughts. Moreover:
-
I've seen cases where the last error wasn't the useful one - could we do more to report the other errors encountered along the way (maybe collecting them as suppressed exceptions?
-
If the inner exception is a
ConnectTransportExceptionthen it will report the address to which it was trying to connect, but other exceptions will not include this and I think that makes further investigation much harder. Could we make sure that we include the target address always?
server/src/main/java/org/elasticsearch/transport/NoSeedNodeLeftException.java
Outdated
Show resolved
Hide resolved
| openConnections(finished, attemptNumber + 1); | ||
| if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) { | ||
| logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e); | ||
| finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e)); | ||
| } else { | ||
| openConnections(finished, attemptNumber + 1); |
There was a problem hiding this comment.
With this change, do we still need to check connectionManager.size() in the else branch below? I.e. can we remove this cause-free exception?
finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias));
Or maybe we should have the attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN check in the onResponse handler above too.
| } | ||
| logger.warn(() -> "fetching nodes from external cluster [" + clusterAlias + "] failed", e); | ||
| listener.onFailure(e); | ||
| listener.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e)); |
There was a problem hiding this comment.
I think this only makes sense if seedNodesSuppliers.hasNext() == false. If we encountered some other more fundamental error then we didn't run out of seed nodes here.
Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.
|
@DaveCTurner I pushed an update that should address your comments. Could you please take another look to see whether the changes match your suggestion? Thanks! |
| .put("cluster.remote.invalid_remote.mode", "proxy") | ||
| .put("cluster.remote.invalid_remote.proxy_address", fulfillingCluster.getRemoteClusterServerEndpoint(0)) | ||
| .build() | ||
| randomBoolean() && false |
There was a problem hiding this comment.
&& false maybe leftover from a debugging session?
...g/elasticsearch/xpack/remotecluster/RemoteClusterSecurityLicensingAndFeatureUsageRestIT.java
Show resolved
Hide resolved
| if (seedNodesSuppliers.hasNext()) { | ||
| listener.onFailure(getNoSeedNodeLeftException(Set.of(e))); | ||
| } else { | ||
| listener.onFailure(e); |
There was a problem hiding this comment.
Shouldn't this be flipped? I.e., if there are no seed nodes left, we want a NoSeedNodeLeftException or no?
There was a problem hiding this comment.
Went back to the original thread and I see the intention now: if we hit an exception that we think we can't recover from by moving on to a different seed node, we don't want to wrap in NoSeedNodeLeftException here.
Given that, the behavior is still confusing though: suppose we have two seed nodes, and we fail with a ConnectTransportException on the first one, then with a "terminating" exception on the second node. In that case, we don't wrap the exception. However, if we had three seed nodes, and failed on the second with a "terminating" exception, we would wrap.
Sorry if I'm missing something here! If I am, it would be great to include a comment in the code with a little more context.
As a side note: not wrapping on "terminating" exceptions is what seems to produce the inconsistency in status codes between proxy mode and sniff mode (e.g., for the license failures, for get a 500 in proxy mode, but a 403 in sniff mode). I'm wondering if we should introduce a second class of failure that still produces a 500 for the case where we don't want to return a NoSeedNodeLeftException.
|
Just pushed an update to revert some changes to SniffConnectionStrategy. I think it's better to focus on proxy mode in this PR as suggested in the title. |
|
@DaveCTurner @n1v0lg Friendly reminder for review at your convenience. As commented above, I have decided to take a minimal approach with this PR to focus on its primary goal. Happy to work on follow-ups once this one is through. |
n1v0lg
left a comment
There was a problem hiding this comment.
LGTM. Sorry for the delay!
From my end, I'm good with only updating proxy mode, and leaving out the sniff mode improvements. Likewise, good with not further refactoring proxy connection strategy in this PR.
| } | ||
| } | ||
|
|
||
| private NoSeedNodeLeftException getNoSeedNodeLeftException(Collection<Exception> suppressedExceptions) { |
There was a problem hiding this comment.
Since we're only ever passing an empty set here, we could just remove the suppressedExceptions arg altogether.
There was a problem hiding this comment.
Since I have reduced scope of this PR. This method is no longer necessary. I inlined it which also removes the need of suppressedExceptions. Thanks for spotting it.
| if (countDown.countDown()) { | ||
| openConnections(finished, attemptNumber + 1); | ||
| if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) { | ||
| logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e); |
There was a problem hiding this comment.
Sanity checking that warn is the right log level: this could be caused by longer-lasting but still temporary network glitches, so I'm wondering if we want info or even debug here instead.
There was a problem hiding this comment.
warn is good. info shouldn't really be used to report exceptions, and debug is hidden by default so will make troubleshooting too difficult. See these docs for more info.
There was a problem hiding this comment.
cheers for the pointer, all good to keep this at warn here!
|
@elasticmachine update branch |
|
@DaveCTurner Do you still plan to review the PR? I'd like to get it merged before 8.8 FF. Thanks! |
|
@elasticmachine update branch |
|
I'll be merging this PR. Happy to work on follow ups as discussed. |
|
Failure is #95514 |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
|
@elasticmachine run elasticsearch-ci/docs-check |
This PR removes a bogus assertion from ProxyConnectionStrategy. The statement tries to assert that all connection exceptions are caught in the if branch and we should have at least one connection when the code reaches the else branch. This is simply not true because the process of checking whether we should open more connections and subsequently openning connections are *not* atomic. It is possible that the number of connections changes, e.g. remote end drops the connection, in between which makes the code flow go to the else branch and trips the bogus assertion. Relates: elastic#94998 Resolves: elastic#99113
Today, when the number of attempts is exhausted, ProxyConnectionStrategy checks the number of connections before returns. It reports connection failure if the number of connections is zero at the time of checking. However, this behaviour is incorrect. In rare cases, a connection can be dropped right after it is initially established and before the number checking. From the perspective of the `openConnections` method, it should not care whether or when opened connections are subsequently closed. As long as connections have been initially established, it should report success instead of failure. This PR adjusts the code to report success in above situation. Relates: #94998 Resolves: #99113
When a remote cluster node fails to connect in sniff mode, the
underlying error is preserved and reported. But in proxy mode, the
underlying error is swallowed and it reports only a generic "Unable to
open any proxy connections" error.
This PR improves the error reporting with proxy mode to align with that
for the sniff mode. This allows consistent error message in the API
response and clearer diagnositic for underlying issues, e.g.
authentication failure, incompatible license etc.